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CHAPTER 10 


10.1 Sorting 

10.2 Searching 

10.3 Random Number Generation 
10.4 Cryptography 

10.5 Compression 

10.6 Hashing 

10.7 Summary 


In this chapter 


An algorithm, as discussed in previous chapters, is a step-by-step description of a means to 
solve a problem. As someone who is learning to program, what are the most important 
algorithms? That rather depends on how “important” is defined. Does it reflect commercial 
value? Number of times it is used? Pedagogical uses? Since there are many ways an algorithm 
can be important, this chapter deals with the most common algorithms discussed on 
programming web pages and in introductory computing texts. None of these methods require a 
knowledge of advanced mathematics or data structures. 


SORTING 


Most people know what sorting is and can sort a small sequence of numbers in a few 
seconds. Each may have a distinct strategy for doing it, but few can 
explain to someone else how to sort an arbitrary set of numbers. They themselves may not 
know how they do it; they can simply tell when something is sorted, and have some process for 
sorting in mind. In short, the process of sorting is one of the simplest things that is hard to 
describe. 


Because sorting is so important in computer science, it has been studied at great length. But 
what is it? Sorting involves placing things in an order defined by a function that ranks them 
somehow. For numbers, ranking means using the numerical value. So: the sequence 1, 3, 2 is 
not in proper order, but 1, 2, 3 is in ascending (getting larger) order and 3, 2, 1 is in 
descending (getting smaller) order. Formally, a sequence s is in ascending order if sj <= sj-1 


for all i. The act of sorting means arranging the values in a sequence so that this is true. It is 
clear that it can be decided when a sequence is sorted. 
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So how can a sequence be placed in sorted order? By using a sorting algorithm, of course. 
For all of the following discussion on sorting, assume that the problem is to sort into ascending 
order. 


ieee Selection Sort 


Small sequences are easier to sort than longer ones, and may provide some insight into the 
process. The sequence: 


84 


is not sorted in ascending order, but testing this is easy and fixing it is trivial: simply swap the 
two values. The longer sequence: 


849 


is also not sorted but is more difficult to sort because it is longer and there are more 
combinations of the numbers that are unsorted. How can this sequence be placed in order? 
Here’s one idea: 


1. Find the smallest element in the list. 
2. Swap that element for the element at the beginning of the list. 
3. Find the smallest element in the rest of the list. 


4. Swap that element for the second element in the list. 
... and so on until the list is sorted. 


This is called the selection sort algorithm, because at each stage it selects the smallest of 
the unsorted items in the list and places it where it belongs. Consider the following list: 


[12, 18, 5, 21, 9] 
01234 - index 
The smallest element in this list is 5, at index 2. Swap element 2 for element 0: 
[5, 18, 12, 21, 9] 
The bold elements above are in sorted order, which here is only the one at location 0. For 
the remainder of the elements, repeat the process of finding the smallest element and placing it 


at the beginning of the unsorted list (element 1). That means swapping 9 for 18, element 4 for 
element 1: 


[5, 9, 12, 21, 18] 
Repeating, it turns out that element 2, value 12, is now the smallest, and is in the correct 
place. 
[5, 9, 12, 21, 18] 
Now the value 18 is smallest and should be placed at location 3. 
[ 5, 9, 12, 18, 21] 
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Now the sort is complete. When only one remains it must be in the correct place. 


Finding the smallest element in a list involves three things. First, begin with the initial 
element and assume that is it the smallest. Identify it using its index imin. Next, check the value 
of all successive elements in the list (from imin to the end of the list) against the value at imin. 
Finally, in the case where one of the successive values at index k is smaller than the one at 
index imin, set imin to k to indicate where a new smallest value was found. In simple, 
imprecise English, scan all of the elements above imin and remember the location of the 
smallest one. Presuming that the list to be sorted is named data, the code for finding the 
smallest element from imin to the end of the list is: 


for i in range (imin, len(data)): 
if data[i] < data[imin]: 
imin = i 
This code does work, but it modifies imin, which is used to determine the loop bounds, 
within the loop itself. This can be confusing to some, and is bad form generally. It is better to 
code this loop as: 
imin = istart 
for i in range (iend, len(data) ): 
if data[i] < data[imin]: 
imin = i 
What happens after this is to swap the smallest value found for the one at location istart. In 
most programming languages this would take three statements, which would look something 
like this: 
temp = data[imin] 
data[imin] = data[istart] 
data[istart] = temp 
One of the joys of Python is that this swap can be performed using a different, some would 
Say prettier, syntax: 
(data[istart], data[imin]) = (data[imin], data[istart] ) 
This is the core of the algorithm, and needs to be done for all values of imin; that is from 0 


to len(data)-1. This is another for loop, of course, within which this code is placed. That outer 
loop would be: 


for istart in range (0, len(data)-1): 


This is all that is needed for the sort. Writing it as a function, it looks like this: 
def selection (data): 


for istart in range (0, len(data)-1): 
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imin = istart 

for i in range(istart,len (data) ): 

if data[i] < data[imin]: 
imin = i 

(data[istart], data[imin]) = (data[imin], data[istart] ) 

This sorting method appears to be natural to humans. It is the one most often described by 
students when asked how they sort numbers. It is not the fastest in many cases, but does a small 
number of swaps. If the data is already sorted it does no swaps; if it is in reverse order it does 
len(data)-1 swaps, the smallest that can be done and still sort the list. When looking at 
algorithms it is common to define a worst case and a best case, and to define performance not 
in seconds but in terms of one of the operations performed. In that way the nature of the 
computer, whether it is fast or slow, does not affect the analysis. For sorting it is common to 
select the operation to be used as a basis for comparison to be the compare operation: data[i] 
<data[imin]. How many of these are done? 


The best case for the selection sort occurs when the list is already sorted. In that case it will 


perform close to N2 comparisons, where N = len(data). This is the same number of 
comparisons needed for the worst case, in which the list is in reverse order. At least it is 
consistent. However, it minimizes the number of times swaps occur, and if swapping is 
expensive then this could be the sorting method to choose. 


Selection sort is unstable. If there are repeated values in the data, then they will of course 
end up together in the final, sorted list. However, if a sort is stable they will remain in the 
same order they were originally. Selection sort, like many others, does not guarantee this. It 
seems as if this is a minor thing, but it does matter in some cases. Consider a list of names ina 
list that are given, in order of some sort of score, on a web page. Names for tie scores should 
always be in the same order on the page, so that if the page is refreshed or a link is followed 
the page looks the same. 


It should be said here that generally there is no best sorting method. The properties of such a 
method would be: 


1. Fast. Selection sort is N2 in terms of comparisons. The best one can normally expect 
from any sort would be N*log(N) in the worst case. 


2. Does not need extra space. This means that the array can be sorted in place, with 
perhaps a temporary variable for performing swaps. 

3. Performs no more than N swaps in the worst case. 

4. Adaptive. The method detects when it is finished instead of looping through 
unproductive iterations. If, for example, such a method is given an already sorted list, it 
will finish in a single pass through the data. 


5. Stable. 
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No method has all of these characteristics. 


Bes Merge Sort 


If there were a “best” sorting algorithm then this would be the place to 
describe it. As there is not, perhaps the best thing to do would be to look at an algorithm that is 
quite different from the selection sort, and that has properties that it does not have. The method 
named merge sort fits that description nicely: it is an N*log(N) sort, it does need extra space, 
and it uses more than N swaps but it is stable. 


Merge sort is an example of a divide and conquer style of algorithm, in which a problem is 
repeatedly broken up into sub-problems, often using recursion, until they are small enough to 
solve; the solutions are combined to solve the larger problem. The idea behind merge sort is to 
break the data into parts that can be sorted trivially, then combine those parts knowing that they 
are sorted. Using the sample data from the selection sort example, the first step in the merge 
sort is to split the data into two parts. There are 5 elements in this list, and the middle 
element would be at 5//2, or 2, so the two parts are: 


[12, 18] [5, 21, 9] 
Splitting again, the first set has 2 elements, the middle being at 0; the second set has 3 
elements, so split at 1: 
[12] [18] [5] [21,1] 
The final split breaks the data into individual components: 
[12] [18] [5] [21] [1] 
The splitting is done in such a way that the original locations are remembered. This happens 


in the recursive solution, but could be done in other ways. One way to visualize this is as a 
tree structure: 


[12, 18, 5, 21, 9] 
[12, 18} 5, 21, 9] 


12] D8] 5] P19] 


[21] [9] 


This completes the divide portion of the divide and conquer. Now that the individual 
elements are available, it is easy to sort them, as pairs. On the lower right the pair [21] and [9] 
is out of order, so they must be swapped with each other. Now they are sorted. On the next 
level upwards, looking from left to right, the elements are sorted, although most are single 
elements: 
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[12, 18, 5, 21, 9] 


[12, 18] (5, 21, 9] 


[12] [18] [5] (9. 21] 


Moving up again, [12] and [18] are combined to make [12,18], a sorted pair. On the right, 
the singleton [5] is merged with the pair [9,21] by looking at the beginning of each list and 
copying the smallest element of the pair into a new list: 


Step List1 List2 Merged list 
1 [5] [921] > [5] 5 is smaller than 9 
2 {] [9,21] => [5, 9] first list is empty, copy 9 
3 {] 221) > [5.9.21] first list is empty, copy 21 
4 { a Final list: [5.9.21] 
The result is: 
[12, 18, 5, 21, 9] 
[12, 18] (5, 9, 21] 


At each stage, the lists contain more elements and they are sorted internally, smallest 
element at the beginning. Combining a pair of these is simply a matter of looking at the element 
at the beginning of each and copying the smallest one to the result until the lists are empty. The 
next, and final, merge in this set of data would be: 


Step List 1 List 2 Merged list 
1 [12,18] [5,9,21] [5] 5 is smaller than 12, copy 5 
2 [12, 18] [9, 21] [5, 9] 9 is smaller than 12, copy 9 
3 [12, 18] [21] [5,9,12] 12 is smaller than 21, copy 12 
4 [18] [21] [5,9,12,18] 18 is smaller than 21, copy 18 
5 [] [21] [5,9,12,18,21] First list is empty, copy 21 


The final list is [5, 9, 12, 18, 21] which is sorted, as promised. 


Once the data has been split into individual components, the merge stage creates sorted 
pairs, the next merge creates sets of 4 sorted numbers, the next 8, and so on, doubling each time 
until they are all sorted. A logical way to write the program is to use recursion, where each 
recursive call splits the data in two more parts until there is only one element. The lowest 
level of recursion combines the individuals into sorted pairs, and returns to the next level 
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where the pairs are combined into fours, then eights, and so on until at the highest level the list 
is completely sorted. Written as a recursive function this is: 


data = [12, 18, 5, 21, 9] 
def mergesort (data): 
n = len(data) # For this call there are n elements 
# to be sorted 
if n <= 1: # Divide the data into two parts 
return # unless n-1, which means sorting is 
# complete 
middle = n//2 # Index of the element in the middle 
lower = data[:middle] # Lower indexes, or the left 
# sublist 
upper = data[middle:] # Larger indices, or the right 
# sublist 
mergesort(upper) # Sort the left sublist 
mergesort(lower) # Sort the right sublist 


# There are now two sorted sublists of length N//2. 
# Merge them into one list of length N 
(i,j,k) = (0,0,0) 
while i < len(lower) and j < len(upper): # One sublist 
# may be shorter ... 
if lower[i] <= upper[j]:# If the element at index i 
# of the 
data[k]=lower[i] # left list is smaller, 
# copy it to the result 
1i=i+1 
else: 
data[k]=upper[j] # Otherwise copy the element 
# at index j 
j=j+1 # of the right sublist to 
# the result 
k=k+1 # Result gets longer by 1 element 
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for iin range (i,len(lower)): # If the left list was 
# longer, copy 
data[k] = lower[i] # the remaining items to 
# the result 
k=k+4+i1 
for j in range (j, len(upper)): # If the right list was 
# longer, copy 
data[k] = upper[j] # the remaining items 
# to the result 
k=k+et+i1 
The merge sort is not as obvious as was selection sort, but is faster in most cases. It has 
another interesting application: it can be used to sort files. If a file contains, for example, a 


billion data samples that need to be sorted it is unlikely that they can be read into memory and 
sorted with a selection sort. How then to sort them? 


SEARCHING 


Searching is the act of determining whether some specific data item appears in a list and, if 
so, at which index. It seems like an odd thing to do; what can be done knowing this 
information? It is especially useful when multiple lists hold different data concerning the same 
items. An employee, as one example, might have their various data saved as a name list, an 
employee ID list, phone number, office number, home address, and so on. The same index 
gives information of the same individual for each list. Thus, search the employee ID list for 
18762; if that index is 32, then the employee’s name can be found at name[32]. 


Of course Python has built-in operations on a list that will do this: 
if 18762 in employeeID: # Is this ID a member of the list? 
k = employeeID.index(18762) # What is the index of 
# 18762? 
A reason to examine searching algorithms is that not all languages possess these specific 
features and not all programs are written in Python. Another is that someone had to implement 
the operations for the Python system itself, and they had to know how. Did they do a good job? 


Are the built-in operations as fast as ones that a programmer could code for themselves? This 
will be discovered using an experiment. 


Pag Timings 


Any section of code in Python requires some amount of time to execute. The specific amount 
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depends on many things: the computer being used, the 

Python compiler, the specific statements, the data, and random events such as what other 
programs are executing on that computer at the same time. However, if it is important to know 
whether a section of code is faster than another, there are timing functions that can provide a 
pretty good idea. The time module includes a function named clock() that returns (on 
Windows) the elapsed time expressed in seconds elapsed since the first call to this function. 
On Linux it behaves differently, and time.time() may be a better choice. Be sure to look it up. 


Timing a section of code is done by calling time.clock() before and after the code executes 
and subtracting the two times. For example, timing a search of a list using the in operator could 
be done this way: 

import time 


list = [19872 ,87656,10982,18756,56344,29765,12856,12534, 
88768 ,90012] 


tO = time.clock () 
if 90012 in list: 
found = True 
tl = time.clock () 
print ("Time was ", t1-t0) 
This prints the message: 
Time was 2.062843880463903e-05 
That’s a pretty small time, as is to be expected. When run again the result was 3.07232e-06; 
running again gets 2.194514766e-06 and again 7.9002531e-06. These numbers are all small 
but very different. Since that is true it is better to time many executions of the code and divide 
by the number of times it ran: 
tO = time.clock () 
for i in range (0,10000): 
if 90012 in list: 
found = True 
tl = time.clock () 
print ("Time was ", (t1-t0)/10000) 
This yields more consistent results: 5.5284e-07, 5.5951e-07, and 5.415e-07 in three 


different trials. Averaging the result of multiple trials gives even better results, because 
spurious times on any one run will be averaged out. 


ives linear Search 
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Consider the list that was used in the timing example: 
list = [19872,87656,10982,18756,56344,29765,12856,12534, 
88768 ,90012] 


Finding whether the target number 90012 appears in this list is a matter of looking at each 
element to see if it is equal to the target. If so the answer is “yes” and, by the way, the index at 
which it was found is also known. This can be done in a basic for statement: 

index = -1 

for i in range(0,len(list)): 

if list[i] == target: 
index = i 
break 


# If the value of index is >= 0 then it was found. 


This algorithm looks at each element asking “Is this equal to the target?” When/if the target 
is located, the loop ends and the answer is known. If the target is not a member of the list, then 
the algorithm has to examine all members of the list to determine that fact. Thus, the worst case 
is when the element is not in the list, and it requires N comparisons to find that out. If the 
element is a part of the list then, on the average, it will require N/2 comparisons to find it. It 
could be the first element, or the last, or any of the others, which averages out to N/2. 


If the list is in sorted order then the loop can be exited as soon as it is known whether the 
element is in the list or not. That is, as soon as the target is smaller than the element it is being 
compared against in the list, it is clear that it can’t be a member of the list, and the loop can be 
exited. This normally speeds up the execution, but the penalty is that the list has to be sorted, 
and the time needed to do this (only once, of course) has to be taken into account. 


iiYa3 Binary Search 


If the list has been sorted then there is a faster way to search for an element. The list can be 
divided into two parts by looking at the value in the middle of the list and comparing it to the 
target. If the target is smaller than the middle element, then it would have to be in the lower 
indices (left), otherwise it would have to exist in a higher valued index (to the right). What this 
means for performance is that the search area is cut in half each time a comparison is done. 


This idea seems simple, but is actually difficult to get right in an implementation. At 
conferences where many PhDs in computer science are presenting papers, it has been found 
that fewer than 10% of the participants can code a binary search that works the first time. The 
terminal conditions are tricky: in particular, how can it be determined that the target is not in 
the list? OK, so the details are crucial. At the beginning there is a list, and its length is known. 
The index of the middle element is known too, and the list is sorted. So: find the index of the 
middle element: 
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istart = 0 
iend = len(list) 
m = (iendtistart) //2 


If the target is in the list, is it ata smaller index than m (i.e., is list{m]>target): 
if list[m]>target: 


If so, don’t bother looking at any index bigger than m. In other words, the largest index to 
look at would be nr1: 


iend = m-1 


If the target is in the list, is it at a larger index (i.e., is list{m]<target)? If so, don’t look at 
any locations with an index less than m; in other words: 
elif list[m]<target: 
istart = m+tl 
If target = list[m] then it has been found and the algorithm terminates. 
else: 
return m 
This code has to be repeated until the target has been found, or it has been determined that it 
is not in the list. The loop condition is critical. The loop continues so long as istart <= iend so 
that if the final step finds the target in the list, then it will return the index. If the loop exits 
without finding the element, then the index value is -1. The final code, as a function, is: 
def search (list, target): 
istart = 0 
iend = len(list) 
while istart<=iend: 
m = (iendtistart) //2 
if list[m]>target: 
iend = m-1 
elif list[m]<target: 
istart = mt+tl 
else: 
return m 
return None 


The speed of the binary search depends on the fact that it is searching a randomly accessible 
data set like a Python list or a Java array, and not a file. It will take on the order of log(n) 
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probes into the list to find what it is looking for or to determine that it is not there. 


Timing the binary search gave an execution time of 3.305e-06 seconds, still slower than the 
built-in operation. 


ey RANDOM NUMBER GENERATION 


Python offers a random number module named random that offers a broad collection of 
random number generation facilities. How is it possible to generate a random number using 
software? Shouldn’t a computer program execute consistently and always produce the same 
answer each time? Yes, it should. The resolution of this apparent problem lies is the definition 
of random. 


First, randomness is defined only for collections of events or numbers. One number, or even 
a small collection, can’t be said to be random. Randomness reflects the lack of a pattern, and 
only one or two events don’t really display a pattern. Randomness is more of a Statistical 
property of a sequence, and is not necessarily related strictly to unpredictability. After all, if a 
computer program can generate random numbers, then it should be possible to predict the next 
one it will generate. 


A random number generator (RNG) on a computer is referred to as pseudo-random; it is not 
truly random, but exhibits properties of randomness. These properties can be tested 
Statistically. A typical RNG returns a floating point number between 0.0 and 1.0. This value 
can easily be transformed into a random number, either real or integer, in any desired range. A 
die roll is an integer between 1 and 6 inclusive. An RNG function named rand01() can be 
converted into a die roll as: 


int (randOl()*6o + 1) 


If the numbers generated by rand01() are random, then it should produce die rolls that each 
have a probability of 1/6. If not then there is a bias. 


If a coin is flipped many times and the sequence HTHTHTHTHTHTHT results, the 
probability of H or T (heads or tails) is 0.5, or 50%, which is what would be expected. If a 
sequence has the correct percentages for each outcome, then it passes the frequency test. Yet 
this sequence is probably not random because of the obvious pattern in the results. The 
frequency test is not enough. 


A second test would consider pairs in the sequence and compare the probability of 
occurrence of each pair against the theoretical. In the coin toss there are four possible pairs: 
HH, HT, TH, and TT. Each pair should appear with equal probability, and yet the string above 
shows only HT instances. It is not random. A standard suite of randomness tests called Diehard 
includes a more complex version of this test, involving groups of five elements in the 
sequence, each one having a theoretical probability of 1 in 120. This kind of test can be called 
the serial test or overlapping permutations. 


Parker, J. R. (2021). Python. an introduction to programming. Mercury Learning & Information. 
Created from dogus-ebooks on 2023-09-12 18:40:34. 


Copyright © 2021. Mercury Learning & Information. All rights reserved. 


A third test involves using the RNG to generate poker hands. The probability of specific 
hands is well-known, and any consistent variation from these probabilities would imply a flaw 
in the RNG. This is the poker test. Any complex random game could be used, and the Diehard 
suite uses the game of craps. 


There are many other tests that could be applied, and all are based on generating complex 
situations and comparing the theoretical distribution of properties generated against what the 
RNG creates. So, now that there are ways of testing an RNG, can one be written in Python and 
tested? 


ee Linear Congruential Method 


Pseudo-random number generators basically shuffle the bits around in a number in complex 
and non-repeating ways; at least, they don’t repeat for a large number of trials. A historically 
common method for doing this is to calculate a value that is bound to be larger than the place 
where it is to be stored and keep only the remainder each time. The value of this remainder is 
pseudo-random under certain conditions. A linear equation can be used and is fast to calculate: 


Xj+1 = (aXj + b) mod m(10.1) 


where Xj is the previous random number in the sequence and Xj+ 1 is the next one. The value 
of m should be quite large and it should be a prime number. Many computers have used a 32- 


bit integer size, and as it happens 232 _1isa good value for m ( = 2147483647). Python 
integers can be as large as desired, so larger values could be used. Keeping then to 32 bits is 
accomplished using an and operation and masking the result with a 32-bit constant: 
OXFFFFFFFF. 


Values for a and b are more flexible, but large values are a good idea, and too many factors 
can cause problems. One good set of values is a=69069 and b=362437. This method uses a 
previous value to calculate the next one, so an initial value is required. This is called the seed, 
and it must be possible for a user/programmer to be able to set this seed value to whatever they 
choose. If not then the RNG will generate the same set of values each time it is used. That’s 
actually a good thing for debugging, because when tracking down a problem, it is important 
that the program behave consistently. 

The basic RNG described above would be: 

_xseed = 76951 


def irandOl (): 
global xseed 
_xseed = (69069* xseed+362437) & OxFFFFFFFF 
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return _xseed 


This function returns a number between 0 and 2147483647, and resets the seed (_xseed, a 
global) each time. It’s a good start, but what is wanted is a function that returns a number 
between 0 and 1; so, a second function does this simply by dividing the above result by 
2147483647: 


def rand0Ol1(): 
return irand01() /OxFFFFFFFF 


A function that can set the seed is needed too: 

def setseed (x): 

global xseed 

_xseed = x 

A commonly used function in the Python random package is randrange(a, b), which returns 


a random integer between a and b. The code for a die has already been written, and so the math 
is known. Using the tools just written, this is coded as: 


def randrange (nl, n2): 
x = (int) (rand01()*(n2-n1+1)) + nl 
return x 
How can a random number generator be made to generate a different set of numbers every 
time a program starts using it? Simply by setting the seed to a number that is hard to predict. 
Such a number is found in the low bits (milliseconds and microseconds) of the system clock. It 


is impossible to predict what these will be. So randomizing the RNG can be accomplished like 
this: 


def randomize (): 
global xseed 
_xseed = int(time.time ()) & OxFF 
The time.time() function returns the number of seconds since a fixed date in the past, called 
the epoch. This date is usually January 1st, 1970, midnight. 


Other methods for generating random numbers exist and are commonly used. Python’s 
random class uses the Mersenne Twister algorithm, which is often seen as a default in 
programming languages but is a trifle slow. Blum-Blum-Shub resembles the linear congruential 
but uses the relation xj+1 = xj2 mod m where m is the product of two prime numbers. Dozens 
more methods exist. There are also practical methods for generating true random numbers, and 
these are based on specific hardware that captures a truly random process such as radioactive 
decay, the photoelectric effect, or random electromagnetic noise. 

Finally, there are web sites that will offer random numbers and sequences on request. 


Random.org will serve up true random numbers, for example, and there are dozens of other 
such sites. The time needed to connect to a server and upload a random number is 
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considerable, so they should be used knowing the tradeoff of time for random number quality. 


ie CRYPTOGRAPHY 


Cryptography involves sending messages that only certain intended people can receive and 
understand. This involves codes and ciphers. A code substitutes one string for a longer 
message; there is a code book in which the code strings are associated with their relevant 
message. So, the string “A76” could mean “retreat 100 meters.” Code books had to be changed 
regularly because eventually one would fall into the hands of someone who was not supposed 
to have one. 


A cipher is an algorithm that converts one string of characters into another one of generally 
the same length. It can operate on bits, on characters, or on blocks of characters. A cipher does 
not have a code book but does have a key, which is a string of numbers or characters, that the 
algorithm uses to transform the original string (called the plaintext) into the encrypted string 
(called the ciphertext). The ciphertext can be transmitted safely because it cannot be 
understood without the key. 


Cryptography has become much more important in the last 30 years or so. It’s not just that 
the world is an uncertain place. It is more that people wish to share private information across 
the Internet. If a purchase is made with a credit card, then the card number should be encrypted 
before sending it to the seller. Access to certain sites that have valuable services or 
information requires a password. Installing new software requires an access key. These are all 
examples where encryption is required. 


It should be mentioned that the secure transfer of information depends on operational 
security as well as on encryption. Someone with a password can access all services and data 
associated with that password, so keys and passwords must be protected. This aspect is 
beyond the ability of a programmer to control, and is often the way security systems are 
broken. 


There is some terminology that needs to be understood. A symmetric key system uses one 
key to encode and the same key to decode. Asymmetric systems like public key systems use 
one key to encrypt the message, a key that anyone can know, and a second, private key that only 
the recipient knows and is used to decrypt. A block cipher applies a key to a collection (block) 
of data, often a size of 64, 128, or 256 bits at a time. A stream cipher is usually a symmetric 
key cipher that encrypts a plain text character with a character from the key. It’s also called a 
state cipher because the encryption of the next character depends on what has happened 
before. 

Knowing a little about encryption is important, but it is also important to understand that it is 
a very complex and highly mathematical subject, and requires a significant amount of study to 
become an expert. 
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ieee One-Time Pad 


Having just said how complex the field of cryptography is, the first algorithm to be 
examined is, in fact, rather old and perfectly secure, if difficult to use in practice. Suppose 
person A wishes to send person B the message “Meet you at nine pm at location alpha.” 
Encoding this requires a sequence of random characters at least as long as the message. In 
actual use, this cipher often used pages from books as keys, books that were easily accessible 
by both parties. In this case the following text is used as the key: “it was the best of times it 
was the worst of times.” The encryption process, known to both, and in fact not really a secret, 
is to apply the exclusive OR operation to corresponding characters in the message and the key 
to produce the ciphertext: 


The exclusive OR operation is a bit-by-bit logical operator that is 0 if the two bits are equal 
and is 1 otherwise. It is applied to the numerical representations of the characters. This is quite 
handy because it is very fast and can easily be accomplished using simple hardware. Consider 
the first character in the message “m.” The first character in the key is “i.” The ASCII codes 
are the numbers 109 and 105 respectively, or in binary: 


01101101109 “m” 
01101001105 “i” 
000001004 Exclusive OR 


One interesting observation here is that different characters can be encrypted to the same 
cipher text byte, as in the above string where “s” and “t” both encrypt to 26. Anyway, now this 
ciphertext is transmitted to B and is decoded in exactly the same way that it was encoded: 
apply the exclusive OR between the ciphertext and the same key (symmetric key): 


= ny 28 2, 2G OF 9 : 22 11 26 26 
i Ww a 3 
105 116 119 97 115 116 104 101 98 101 115 116 111 102 116 105 Key ints 
109 101 101 116 121 111 117 97 116 110 105 110 101 112 109 97 XOR 

M e € t y @ u a 12 n i n € Pp m a Decrypted 


The Python code that can do the basic encryption is: 


pt = "meetyouatninepmatlocationalpha" 

key = "itwasthebestoftimesitwastheworstoftimes" 
ce =" 

xt =" 
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for i in range(0,len(pt)): 

v = ord(pt[i]) *ord (key [i] ) 
print (v) 
ct = ct + chr(v) 

print (ct) 

The exclusive-OR operator is ““”, and the expression ord(pt[i])ord(key[i]) performs the 
XOR on the message and the key bytes, as numbers. Doing it again with the same key gets the 
message back. 

The reason that this is called a one-time pad is that the key can only be used once, otherwise 
the cipher is not secure. The security lies in the randomness of the key, and reusing it reduces 
the randomness. Eventually if the same key is used often enough, an observer, someone who 
can intercept all of the messages, can extract the pattern and determine the key. So in practice 
the keys were written on pads of paper and, once used, were destroyed. Keeping the pads 
synchronized between the sender and receiver can be a problem, especially if there are many 
of each. Hence, although the system is secure, it is not used very often. 


4 Public Key Encryption (RSA) 


A public key system is commonly used for secure communication across computer networks, 
and involves one key for encryption and another for decryption. There are many variations on 
the basic idea, some being much too complex to discuss in a few pages, but the RSA algorithm 
is relatively simple, quite popular, and very secure. It is named for its inventors Rivest, 
Shamir, and Adleman. 

The mathematical idea that underlies RSA is that one can find three very large integers e, d, 
and n 


oa 
(m*) modn=m 


for any m, and that even knowing e and n or even m, it can be extremely difficult to find d. 
The values d and e are the keys, and mis the message. 


So, encrypting a message would work as follows: A sends message m to B using B’s 
publicly known encryption key e: 


c=m* modn 


The value of c is the ciphertext and can be transmitted to B. When B receives the message, it 
is decrypted using their private key d: 


d 
m=c"° modn 
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where n > m. This works because of the original assertion that (m®)4 mod n= m. The success 


of this method depends on a few other things: can cl mod n be calculated quickly enough for 
large numbers (i.e., 500 bits), and can the numbers d, e, and n be found to make this work? 


The first step in determining the keys is to select two very large prime numbers p and q. Let 
n = p*q. A large number in this context has hundreds of bits, but that creates a cumbersome 
example, so smaller numbers will be used in this discussion. 


Now calculate @(n) = (p-1)*(q-1) and find an integer e so that e and n are co-prime; that is 
the greatest common divisor between e and nis 1. 


Let d = (e-1) mod @(n) so that d*e mod n= 1. This can be found using a search, which may 
be infeasible due to the size of the numbers: 


for i in range (e, n): 
if (i*e)%j == 
d=i 
break 


or a mathematical process that uses Euler’s theorem can give the answer faster, and code has 
been provided for this on the accompanying disc. 


Example: Encrypt the Message “Depart at Dawn” Using RSA 
The first step is to determine some keys to use and to distribute the public key. Using the 

prime numbers 73 and 83 (far too small for a real situation) the determination of the keys is: 
nis 6059 and o(n) is 5904 


e is 17, chosen because it is prime. Now find d such that d*e mod n = 1. Searching for it is 
practical for numbers this size and one gets: 


d = 3473 


So the public key is ( 17 , 6059 ) and the private key is ( 3473 ). 

The message is 14 characters long, and would be 112 bits; n is only 10 bits long, and the 
message has to be shorter than this. In this instance the message can be sent one character at a 
time, but this is generally poor practice. Normally larger blocks of data are encrypted at one 
time. The plaintext string is converted into integers using ord(), and each one is encrypted 
using the formula: 


c=m* modn 
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An example would be: 


message = "Depart at dawn" 
imessage = () 
cmessage = () 


for iin range (0, len(message) ): 
m = ord(message [i] ) 
imessage = imessage +(ord(message[i]) ,) 
c = (m**e) $n 
cmessage = cmessage + (c,) 
Now the message consists of 14 blocks of 1 character each. It can be transmitted to the 
recipient, who is normally named B or Bob, in this form. The sender, named A or Alice, had 


access to the public key only, which is all that is needed to encrypt the message. It cannot be 
decrypted using the public key. 


d given d-e = 1 (mod (n)) 
Bob receives the ciphertext message, which in this case is: 

(4652, 3518, 4274, 5770, 1663, 344, 2498, 5770, 344, 2498, 2144, 5770, 1725, 4601) 
He takes each block and decrypts it using: 


d 
m=c" modn 


The Python code for this could be: 
dmessage = () 
for i in range (0, len(cmessage) ): 
c = cmessage [i] 
(c ** d) Sn 
dmessage = dmessage + (m, ) 


m 


The resulting decrypted message is: 

(68, 101, 112, 97, 114, 116, 32, 97, 116, 32, 100, 97, 119, 110) 

Which is the original message. Notice that because only one block per character was 
encrypted, the effect is that of a substitution cipher, in which each letter has been replaced by 
another. This is very easy to decrypt by noting patterns of letters and frequencies of letters in 
the language; the letter “e” is usually the most commonly used letter in an English message. 
That is why the message is encrypted as blocks of characters. It is highly unlikely that a large 
block would be repeated exactly, and if it were it would be difficult to guess what it was 


anyway. 
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itieey COMPRESSION 


A little arithmetic will start this discussion. The song “Blackbird” by The Beatles is almost 
exactly 4 minutes long. This is 240 seconds, and if it was converted into digital form, it would 
be sampled at a rate of 44,100 samples each second. This means that the song has 240*44100 
= 10.6 million samples. But wait—it’s stereo, so double that to 21.2 million samples. A typical 
sample is 16 bits, so this works out to 42.4 million bytes; 42 megabytes! The MP3 file for this 
song is typically 1.9 megabytes. How is that possible? By using a compression algorithm. 


Data compression is all about ways to take, for example, 100 bytes of information and turn it 
into 10 bytes while losing none of the essential message. Of course, compressed data is 
incomprehensible just to look at and must be decompressed in order for it to be used. Data is 
often compressed before storing it in a file to reduce its footprint on the storage device, or 
before transmitting it along a communications channel to take better advantage of limited 
bandwidth. 


The question of how a string of data bytes can be made shorter while losing no important 
information remains, and a simple example may be in order. Consider a cartoon image. These 
have a relatively small number of distinct but vivid colors, usually less than 10 colors and the 
color variation within any region is small. The example image in Figure 10.1 is in PNG form 
and is 23.2 Kbytes in size at 400x456 (= 182400) pixels. As raw data it would be a little over 
182Kbytes in size, 547 Kbytes if RGB color was used. 


tj » | 


Figure 10.1 
Sample image for compression. 


A simple compression technique that will work in this case is called run-length encoding. 
In its simplest form data bytes are preceded by a count indicating how many repetitions of that 
value were encountered in the data. So if there was a section of data: 


11000000222121200000022222 
This would be encoded as 


21 60 32 11 12 11 12 50 52 
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Two ones six three twos a a a a five zeros five twos 
Zeros one two one _ two 


In this case the original data required 26 bytes and the compressed data required 17 bytes. 
The new data takes 65% of the space that the original does. This is not a huge saving, but is 
probably worth the effort. It does depend heavily on the nature of the data. 

Consider the image of Figure 10.1. The color areas are uniform and rather large, so this 
image would be an ideal candidate for run-length encoding. When writing the program, it is 
important to use a binary file and convert the value and count into unsigned bytes before 
writing them to the file. This is a new data type called an unsigned byte that was not discussed 
in Chapter 8, and has the code “B.” So, writing the count and value could be done like this: 

Ss = pack ("BE", n, vi[l])) 

f.write(s) 

The entire program that will run-length encode the image will read the image file and collect 
identical pixels, counting them as they are collected, until a change in pixel value occurs. Then 
the (count, value) pair is written to the file. Also the pair will be written if 255 pixels have 
been collected, since that is the biggest number that can be counted in 8 bits. The result is a 
binary file of pairs of numbers (count, value) that represent the pixels in the image. As there 
are only two colors, value can be 0 or 1, 0 being white and 1 being green; in general there can 
be 256 distinct values. The encoding program looks like this: 

from struct import * 

import Glib 

def emit(v, n, £): # Write a pair of bytes (count, value) 

s = pack("BB", n, v[1]) #" 'B' is unsigned byte. 


£.write (s) 


Green (123, 210, 0) # The object color 
White = (255, 255, 255) # The background color 


Glib.startdraw(400, 456) 

b1 = Glib.loadImage ("bl.png") # Read the image 
outf = open ("bl.txt", "wb") # Open the output file 
Glib.image (bl, 0, 0) # Display the image 

0 

value = Glib.getpixel (b1,0,0) # Read a pixel value 


count 


# initially 
for j in range (0, 456): 
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for iin range (0, 400): # For every image pixel 
if count ==255: # Largest possible count. 
emit (value, count, outf) # Write (255, value) 
count = 0 # Reset the count 
c = Glib.getpixel (bl, i, j) # Get the next pixel 
value 
if c == value: # Same as before? 
count = count + 1 # Yes. Increment the count 
else: # No, different 
emit(value, count, outf) # Write the count and 
# value 
count = 1 # Reset the count 
value = c # and the value 
if count>0: # After the loop ends are there pixels to 
# write? 
emit (value, count, outf) # Yes, so do it. 

outf.close () 

Glib. enddraw () 

The decoding program reads pairs of unsigned bytes from the binary file, and creates pixels. 
A pair (12, 0) would be 12 white pixels, for instance. A pair (12, 1) could be 12 pixels of 
some other color, and this program writes the pixels so it decides what color that will be. It 
will read pairs and draw pixels, into an image of 400 columns and 456 rows, until all are 
accounted for. A program that does this (not the only one possible) is: 

from struct import * 


import Glib 


Glib.startdraw(400, 456) 

inf = open ("bl.txt", "rb") # Open the run length encoded 
# file 

i= 0 

0 

cols = 400 # The size is known, but could be a 

rows = 456 # part of the data file. 
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while True: 
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s = inf.read(2) # Read a (count, value) pair (bytes) 
if len(s) <= 0: # End of file? 
break # Yes. Exit loop, stop drawing. 


c,v = unpack("BB", s) # Convert to integers 
if v == 255: # Background pixel? (White) 
Glib.fill (255, 255, 255) 
else: # Object pixel? (Green) 
Glib.fill (123, 210, 0) 


for k in range (0, c): # Draw the pixels to the 
# screen. 
if i >= cols: # At end of column, add 1 to row 
i=0 
jJjzajz tl 
Glib.point (i, j) # Draw as a 'point' 
i=i +1 # Next pixel (increase X) 
if j>= 456: # Last row 
break 

Glib .enddraw () 

The more complex the data is, in this case meaning the more distinct values the data can 
take, the less useful this encoding method will be. In some cases it can make the file size larger 
that the raw data would have been. In the case of this particular image, the run-length encoded 
file is about 6K bytes, as opposed to half a million bytes that would have been needed for the 
raw image, saved as pixels. Still, this serves as a basic proof that it is possible to compress a 


data file without losing any information. There are, of course, many more algorithms that will 
compress data to a greater extent and with fewer constraints. 


eye Huffman Encoding 


If a typical text file is examined carefully, it can be found that the vast majority of the file 
consists of relatively few characters. As a general estimate, over 95% of the characters can be 
accounted for by between 25-30 distinct values. A coding scheme that took this into account 
would reduce the size of a text file, and perhaps it would generalize to other kinds of file. For 
example, in many files the value 0 is the most common, and giving it a smaller representation 
than, say, 9 may reduce the overall file size. 
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This is not really a novel idea. The international Morse code is based on this idea and has 
been around for a long time, beginning in 1836. The most commonly used letters in English are 
shown in Table 10.1. In the Morse code the letter “E” is represented by a single dot, the letter 
“T” is a single dash, and “A” is a dot followed by a dash. In other words the most common 
letters have the smallest code representation, as a general rule. This is how the Huffman code 
is organized too. 


Table 10.1 
Frequency of Letters in English Text 


Letter Frequency % Letter Frequency % Letter Frequency % Letter Frequency % 


E 12.5 R 6.1 F 23 K 07 
ao Jr brnnalnnanginngunhannigusnnnianenfrnnaiennngangabann trian 
Fae caer nee ae ae a rae 
pa ee ne Boe a é vie 
elven eagtenpnninaginegineueniennnniy : oe 
a nate 

BS Nt tee EE 


A Huffman code is constructed from the ground up, like a wall. The lower levels of the wall 
represent the least frequently used symbols, and have the greatest number of bricks above them. 
The final code will be binary numbers, and the length of the code in bits for a symbol is related 
to the number of bricks above it. The wall is actually shaped like a pyramid, and is called a 
binary tree by computer science folks. It’s a very useful structure in general, but the description 
will be restricted here to its use in Huffman codes. 


As an example, consider the English text: 


I think that at that time none of us quite believed in the Time Machine! 


The characters occur in this particular text with the following frequencies: 


t 10 a4 ql k1 
e9 m3 cl sl 
i8 02 dl 11 
no u2 vil 
hs b1 f1 


The ‘leaves’ (or nodes) at the bottom of the tree (it is drawn upside-down) contain the 
lowest frequency items, and so are placed first. Each two nodes in the tree will have one node 
above them, straddling them, containing the sum of the frequencies of all nodes below. All 
characters are turned into nodes, and each also contains the number of occurrences of that 
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letter. This collection of nodes will be called a heap. Initially all have only one character, but 
this will change. 


The rule in building the tree is to pick the pair of nodes (initially characters) that sum to the 
smallest number and connect them using another node, one above them that has a left and right 
node. The first bricks, alphabetically, would be “b” and “c” both with a frequency of 1. The 
first two would look like this: 


Figure 10.2 
A step in the Huffman algorithm, lowest level 


The bottom nodes have characters and counts. The one above has only a count, and it is the 
sum of the counts of the two nodes it is connected to. This new node, with a count of 2, is 
placed back in the heap and the nodes for B and C are removed. The heap will always get 
smaller. 


Repeating this process with the others, the smallest pair we can make is with “d” and “f,” 
then “k” and “1,” and then “q” and “s.” At that point the smallest node is “v” with a count of 1, 
but there are no more nodes with a count of 1. The smallest 
sum is 2, which uses “v” and ‘o’: 


(2) (2) (2) 2) (3) 
fea) (Ca) (on) (A) (ka) 09) (9) i) 9) ©) 
\81) (C1) (01) FY (KI) SF (Q1) (81) , (01) 
Figure 10.3 


Entire Huffman bottom level complete 


All of these are in the heap, and a search is done for the smallest sum of nodes. The 
character “u” has a count of 2 and so do any of the nodes above that link to two other 
characters. These are nodes too, so link “u’ with the leftmost node above to get a bigger 
grouping—this is called a subtree, because it is a tree, but it is also part of a bigger tree. The 
‘w’ node and the other gives a sum of 4: 


Figure 10.4 
The first 3 deep tree section: U B C 


The tree that is being built has the least commonly used characters placed at a greater distance 
from the top of the tree than are the frequently used characters. This distance will be used to 
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construct the codes, smaller for common characters. Now the smallest sums of two nodes in the 
tree is 4, the nodes connecting “d,” “f,” “k,” and “1”: 


Figure 10.5 
Next step in the Huffman algorithm: D F K F: 


The method here takes the smallest two nodes, which are going to create the smallest sum, 
and connects them, removing the original nodes and replacing them with the new one. The 
smallest nodes now are the node connecting “q” and “2” (value 2), the node with “m” (value 
3) and the node connecting “v” and “o” (value 3). The node with “m” will be selected to link 
to the 2-valued node. The tree is a disconnected collection of nodes, but right now looks like 
this: 


# 4 a a 7 an 
(02) BT s pS (2) (3) Cy 
1) (ci) (oi) (FA) G) (a) ie ss Ze > 
81) CY CU, KD) UY) a) SP VI) (02) 
Figure 10.6 


Three lower levels complete. 


plus all of the unconnected nodes for individual letters. So, what’s next? The smallest valued 
character remaining is ‘a’ at 4. That would make the smallest sum 7 after connecting it with the 
subtree on the right (‘v’ and ‘o’). Next in the heap are the two 4-nodes above to create an 8, 
and linking ‘h’ (5) and ‘n’ (also 5) to get a 10: 


On @ 
wie ° els -_ = a —< 
(4) (4 em Wwe a rN 
(U >) Nes ai > alt (M3) Va _ 
Moat lan = as A YY ay ry 
Pa v4 ra = 4) 
(B1) (ci) (01) (1 (kt) (11) =i) G) 
Ne Ne tle as Noe a Not Neco (V1) (0 


Figure 10.7 
The next level of the Huffman tree complete. 


The pattern should be clear by now. Notice that the nodes with nothing below them always 
consist of characters, and the nodes above have only numbers. But oops—the space characters 
were not counted, and they must be for the message to make any sense. There are 14 spaces in 
the message. The final sum will be 14+9 for the space. A node for a space has to be added to 
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the heap. 


(sy 
DM 
uv Pt gs 
- os Pa 
On YA) 
~ ow 0 t Ser" 1 
Y a“ a “ag 
ra — Gay (22) 
a), sole nA . Se 
0 of \ a 
Pe as feo Piso rm a, 
S Sha SNF \ , 12) 
o' 7 tt 0 1 0 
e a i 
(ay (4) (5) WW) (5) 
0 a oe p tae eS 
1 U 
sean f fn =," , 
(U2) 2) 2) C27 2 N 
— 1 IPA o>’ 0 Ay = 
61) @) 1) @) «) &) @ @) 
oe Nee sie Se ay Nas Re ee 
Figure 10.8 
The final tree 


The last two steps don’t involve any new characters, but they will link all of the nodes 
together and make them accessible from one single node at the top. The final (top) node should 


have a value that is the length of the original string. 


Now comes the bottom line: what was the point of all this? The tree that has been 
constructed will be used to construct the codes for each letter, and the length of each code will 
be the number of nodes between the characters and the top (root) of the tree. The path to each 
left node is labeled with a digit, in this case a 0, and the path to the right nodes is labelled with 
a 1, as in the tree above. The code for any character is read off of the links that were followed 
to get from the top of the tree to the node containing the character. So the space character, the 
most common one, is reached by going left two times; its code will be “00.” The “t” is the 
second most frequent character, and is reached from the top node by going right, then left, then 


right; the code is “101.” The complete set of codes is: 


‘* 00 A 11111 D_ 011100 
T 101 M_ 11101 F 011101 
E 100 U_ 01100 K 011110 
I 010 O 111101 L 011111 
H 1100 B 011010 Q_ 111000 
N 1101 C 011011 S 111001 


they appear in the message. The encoded message would read: 
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V_ 111100 


The coded message is the concatenation of all of the codes for the characters in the order 
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1 t h : k t h a a t 
010 00 101 1100 010 1101 011110 00 101 1100 11111 101 00 11111 101 00 
t h a t t. 2 om e n re) n e 
101 1100 11111 101 00 101 010 11101 100 00 1101 111101 1101 10000 
° f u s qu ia4ite b 
111101 011101 00 01100 111001 00 111000 01100 010 101 100 00 011010 
e 1 i e V e d : mn em e 
100 011111 010 100 111100 100 011100 00 010 1101 00 101 1100 100 00 
t i m ie m a c nm a ne 


101 010 11101 100 00 11101 11111 011011 1100 010 1101 100 


This amounts to 259 bits = 33 bytes. The original string is 71 bytes long, so the compressed 
data is 46% of the size of the original data. The Huffman coded string is broken into 8-bit bytes 
and transmitted that way: 


01000101 11000101 10101111 00010111 00111111 01001111 
11010010 11100111 11101001 01010111 01100001 10111110 
11101100 00111101 01110100 01100111 00100111 00001100 
01010110 00001101 01000111 11010100 11110010 00111000 
00101101 00101110 01000010 10101110 11000011 10111111 
01101 110 0 010 1101 100 


Decoding requires the table or the tree. If a known table is used, such as the natural 
frequencies of English letters, then it would not have to be transmitted along with the message. 
The use of a Python dictionary type makes the program for decoding very elegant indeed. 
Given the table and the message, bits are removed from the beginning of the message and 
placed into code string until they match one of the codes in the table. The Huffman code has the 
property that the bit sequences are unique when appended as a long message. The first bit 
sequence that matches a code will be the code for the first letter in the message. 


# Huffman decode 

# This is the coded message: 

bitstring = 
"01000101110001011010111100010111001111110100111111"+\ 

"010010111001111110100101010111011000011011111011101100001111 

"101110100011001110010011100001100010101100000110101000111110 

"010011110010001110000010110100101110010000101010111011000011 

"011111101101111000101101100" 

table = {} # This is the table of codes 

table['O00'] =" " 

table["11111"] = "A" 

table["011100"] = "D" 

table["111100"] = "v" 
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table["101"] = "tT" 
table["11101"] = "M" 
table["011101"] = "F" 
table["100"] = "E" 

table["01100"] = "U" 
table["011110"] = "K" 
table["010"] = "I" 

table["111101"] = "o" 
table["011111"] = "L" 
table["1100"] = "H" 

table["011010"] = "B" 
table["111000"] = "Q" 
table["1101"] = "N" 

table["011011"] = "Cc" 
table["111001"] = "Ss" 


Pull bits from the string making a substring until the 


+e 


substring is found in the dictionary. Then emit the 


character indexed. 


# 
while len(bitstring) > 0: 


Loop until all bits are used 


code = "" # Clear the current code 
# While code NOT in the dictionary ... 
while not (code in table): 
# Add the next bit from the message 
code code + bitstring[0] 


# Remove that bit from the message 


bitstring = bitstring[1:] 
# When the code matches, print the character corresponding 
# to the code 
print (table[code], end="") 
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Like many algorithms, LZW compression is named after the people who devised it: A. 
Lempel, J. Ziv, and Terry Welch. It has been the standard for data compression for many years, 
it was the method used in the GIF file format, and was used in many versions of PDF. It is not 
the most effective method of compression, but it is lossless and efficient. Like the Huffman 
code, LZW creates a table from the original text and uses the codes in the table to perform the 
compression. Unlike the Huffman code, the decompression stage does not require that the table 
be known in advance; it builds the table as it decompresses the file. The LZW algorithm also 
replaces multiple characters with single codes, thus increasing the compression rate. 


LZW compression usually begins with a known code table, most often the 256 ASCII 
characters, but any table known by the compressor and decompressor will work. As an 
example, another short section of text from The Time Machine will be compressed: 


The Time Traveller for so it will be convenient to speak of him was expounding a 
recondite matter to us His grey eyes shone and twinkled and his usually pale face was 
flushed and animated The fire burned brightly and the soft radiance of the incandescent 
lights in the lilies of silver caught the bubbles that flashed and passed in our glasses. 


Punctuation has been removed for simplicity. The algorithm begins with a table of 
characters, in this instance the ones that appear in the quote, but in general the table can contain 
any starting set of symbols. This is called the code table, and associates a numerical code with 
a string. The code table in this case will consist of the letters (uppercase) and their values 
starting with 0: “A”=0, “B”=1, and so on. The space has to be included as well. The code 
sequence 024 would be the string “ACE” using this scheme. 


Naturally there has to be more to this if it is to be a viable compression method. When 
encoding, the characters are examined one at a time and appended to an input string, and 
looked up in the table. If the string is found in the table, then the next character is read and 
appended to the string and it is looked up again. This repeats until the string is not found, at 
which point a few things happen: the code for the last string that was found is written to the 
output, the new string that was encountered in the string but not found in the table is added to 
the tables, and the process continues using the last character read in. This means that not only 
characters but also short strings that occur in the text will have numeric codes, and that the 
table will be created from the text that was given. 


Consider the text in the example: The first character seen is “T”: 


1. “T” exists in the table already, so a new character is read in and appended to the “T” to 
create the pair “TH.” 


2. “TH” is not in the table. The character “T” has the code 19, so 19 is written to the 
output file. 


3. The string “TH” is added to the table. It will be code 27. 
4. The input string is now “H.” 
5. The character “H” is in the table and has code 7. The next character is read in and 
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appended to “H” creating “HE.” 


6. “HE” is not in the table, so the code for character “H,” which is 7, is written to the 
output file. 


7. The string “HE” is added to the table, code 28. 
8. The input string is now “E.” 


The process repeats. If a multiple-character string is found in the table, then the steps are 
basically the same. Hypothetically: 


1. The character “T” is next and is in the table. Read the next character “H” and append to 
“T” to get “TH.” 


2. “TH” is in the table. Read the next character “E” and append to “T” to get “THE.” 
3. “THE? is not in the table to emit the code for “TH,” which is 27. 
4. Input string is now “E.” 


Step 1 repeats until a string is obtained that has not been seen before. In the example here the 
first 27 codes are letters and the space character. The next few codes are: 


TH 27 HE 28 E 29 
T 30 TI 31 IM 32 
ME 33 E T 34 TR 35 


The first 3-character string (trigram) in the table is “E T.” 

Python’s dictionary type is especially valuable for coding the LZW algorithm. The facility 
for looking up a string in a table is exactly what is required here. The critical part of the 
program could be written as follows: 

# count is the next unassigned symbol 

# ch is the last character read in 

# s is the current character string 

# inf is the input file (text) 

s = "" # Initial string is empty. 

ch = inf.read(1).upper() # Read the first character, upper 

# case. 

while len(ch) > 0: # While the file still has data... 

if stch in dict: # Is string concatenated with ch 
# in the table? 
s = s + ch # Yes. Concatenate and repeat 

else: # No. 
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print (dict[s]," ", end="") # Print the code for 
# the string s 
dict[st+tch] = count # Put the new string into 
# the dictionary 
count = count + 1 # New code is next integer. 
s = ch # String is now the last 
# character read. 
ch = inf.read(1).upper() # Read a new character 
When decoding the LZW file, the initial table is known. Again, this is often just the ASCII 
characters but can be something else, and in this case is the letters plus the space. The file 
contains codes, not characters, but the codes are in the table, right? No, only the starting codes 


are in the table. So decoding the message in the example starts easily. The first few codes in 
the message are: 


19742619812 291917021... 
The first code is read in and is the code for the letter “T.” This is followed by 7 (“H”) and 4 


(“E”) and so on until the code 29 is reached. There is no entry for the code 29 in the table. This 
is where the really clever part of the LZW algorithm happens. 


When decoding, the program builds the table again. After all, the characters are in the 
same order in the encoded data, so it should be possible to reproduce the process that was 
used to build the code table in the first place. When the first code is read in, the code is 
expected to be in the table, and the corresponding letter “T” is written and placed into a string. 
The next code is read and corresponds to ‘H.’ Now “TH” is added to the dictionary, and “H” 
is written and becomes the current string. Now “E” is seen, “HE” is added to the table, and 
“E” is written, and so on. Again a dictionary can be used to store the codes, but a list is more 
efficient. The indices are codes, which are numbers, so a list is fine here. The central part of 
the process is: 
codel = int(inf.readline()) # CODE1 is the first code 
# on the file 

print (dict[codel], end="") # Output the string for 
# CODE1 

while True: # While mode codes on the 
# file 

codeO = int(inf.readline()) # CODEO is the next code 
# on the file 

if codeO < len(dict): # Is CODEO in the table? 

s = dict[code0O] # YES. S is the string 
# for CODEO 
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else: 
s = dict[codel] # NO. S is the string for 
# CODE1 
s = s + ch # Append CH to S. 
print (s, end="") # IN EITHER CASE emit S 
ch = s[0] # CH becomes the first 
# character of S 
dict = dict + [dict[codel]+ch,] # Add new string to the 
# table 
count = count + 1 


codel = code0O 


A pseudo-code summary of both the encoding and decoding processes is given in Figure 


10.9, and working programs are provided on the disc 


(Izwe.py and Izwd.py). If punctuation is to be added, then a different conversion to 


uppercase would have to be done. For practical applications, the entire ASCII character set 
would be used at the outset. 
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: “None” 
: class” 
: “from” 


: “while” 
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ON DD fF 


LZW Encode 


Read the first 
character from the 
input file as string S 


Read the next 
character from the 
input file string C 


Is S+C in the table? 
YES: S = S+C 


NO: 
Emit the code for S 
Add S+C to the table 
S=Cc 


More characters 
on the file? 


YES 
NO: 
Emit code for S 


Figure 10.9 


The LZW encode and decode algorithms. 


HASHING 


A hashing algorithm attempts to characterize a complex piece of data with something 
simpler, and preferably unique. The most common example would be to find a number that 
could represent a character string. A hashing algorithm has to be fast, because the idea very 
often is to convert a string into an index to a list or tuple. Consider the string “while.” There 
are five characters (bytes) here. How can this string be used as an index into a tuple? 


12: “return” 

13: “global” 
14: “as” 

15: “lambda” 
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LZW Decode 


Read the first code from the 
input file, call it code 


Output the string 
corresponding to code1 


Read the next code, call it 
codeO 


Is codeO in the table? 


YES: 
Let S be the string for code0 
NO: 
Let S be the string for code1 
Let S = S+C 


Emit S 

C = first character of S 

Add code‘ string +C to table 
code = code0 

More Codes? 


YES 
NO: We're done 


Any numerical operation on the codes used to represent the character might work, but some 
result in codes that are too large. Simply adding the codes would give a value of 537, which 
could work but also might be too large. Imagine the application is to look up Python key 
words; there are 33 of them. The value resulting from the hash should be an index between 0 
and 32, so take the hash mod 33. If that is tried the result is that half of the 33 entries will be 
empty, and half will have two or more strings that have the same index. The result is: 


212 “try” 
22: “is” 
25: “finally” 
27: “or” 


31: “global” 
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10: “and” 17: “in” 29: “False” 


11: “continue” 20: “True” 30: “for” 


When two things hash to the same value it is said to be a collision. In this case the collisions 
are: 


(class, def) (False, nonlocal) (return, del) (from, not) (lambda, with) 
(True, elif) (while, if) (from, yield) (global, assert) (False, else) 
(from, import) (and, pass) (is, break) (is, except) (None, raise) 


Two values can’t occupy the same location in a tuple, so something must be done. The 
simplest way to deal with collisions is to have extra space in the list or tuple. If the size of the 
tuple is specified as 145, then all strings hash to distinct values. Of course, now 112 tuple 
entries are empty, but does that really matter? The alternative to a table indexed by hashing (a 
hash table) would be a list that has to be searched, and hashing is very much faster. 


As it happens, simply adding the characters together is not a very good 
hashing method. There are a few well-known ones. 


djb2 
This algorithm starts with a predefined seed for a hash value, multiplies it by 33 and adds 


the next character from the string, multiplies that by 33, adds the next character, and so on. The 
code is: 


def djb2 (s, size): 
sum = 5381 
for i in range (0, len(s)): 
sum = sum*33 + ord(s[i]) 
sum = sum%size 
return sum 
Why multiply by 33? It works well, and nobody knows why. The seed of 5381 can be 
changed to see how different values work. With the configuration given here, there will need to 


be 112 elements in the tuple to avoid collisions. If the program is changed slightly so that an 
exclusive OR replaces the sum, the size decreases to 105. That is: 


sum = sum*33 * ord(s[i]) 


sdbm 


This is a method devised for scrambling bits, but makes for a good hashing function. The 
iteration is hash(i) = hash(i - 1) * 65599 + str[i]. The number 65599 is arbitrary, but happens 
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to be prime. A function to implement this is: 
def sdbm (s, size): 
hash = 0 
for i in range (0, len(s)): 
hash = ord(s[i]) * 65599 + hash 
return hash%size. 


There are many other hashing methods (see: Knuth). The idea is an important one. It is, for 
example, a way to implement Python dictionaries: hash the key to an integer and use that to 
access the value. 


SUMMARY 


The goal of this chapter was to introduce important algorithms or general techniques used in 
computer science. Sorting is a traditional programming problem for undergraduates and is 
essential in many data-handling applications. The selection sort and the merge sort were 
discussed at length. 


Searching involves finding some piece of data within a larger collection. A linear search 
starts at the beginning and looks at consecutive elements until the target is found. A binary 
search splits the data into two halves each time an element in the set is examined and so is 
faster, but it depends on the data being sorted. 


Random number generation creates a sequence of numbers that satisfies a statistical test for 
randomness. Such numbers are crucial in computer simulations and games, and in some 
numerical algorithms. 


Cryptography involves sending messages that only certain intended people can receive and 
understand. A cipher is an algorithm that converts one string of characters into another one of 
generally the same length. The one-time pad method was examined, followed by the very 
popular RSA algorithm. 


Data compression is about ways to take many bytes of information and turn them into fewer 
bytes while losing none of the essential message. Of course, compressed data is 
incomprehensible just to look at and must be decompressed in order for it to be used. This 
section demonstrated run length encoding, Huffman codes, and the LZW algorithm. 


The final section was a brief discussion of hashing, a way to convert strings or other 
complex data types and reduce them to simpler forms such as integers. The djb2 and the sdbm 
methods were singled out as being typical of the way that such algorithms work. 
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Exercises 


. Hashing algorithms must be fast. Use the timing schemes discussed in this chapter to 


determine which of the three hashing algorithms presented is the fastest. 


. When a sequence of numbers is sorted into ascending order then element i-1 is always 


smaller than or equal to element i. Here is a description of a sorting algorithm: scan the 
data set S to find any pairs of adjacent locations where S[i-1] > S[i], and when any are 
found swap the two values. Repeat the process until the array is sorted. Does it ever get 
sorted? What is the best case and what is the worst case? Implement the method in Python. 


. Compare the linear congruential random number generator described in this chapter against 


the random() function in Python. Implement a die roll using each method, and roll a die 
1000 times. Which method is nearest to the expected frequency distribution (equal for all 
values)? Repeat the process 1000 times and score Python one point when its random 
number generator wins by this measure, and score the book’s generator one point when it 
wins. Which is the overall winner? 


. The quality of a hashing algorithm is measured by how random the hash codes are when 


given a sample set of strings. One estimate of randomness is the number of cells with more 
than one value hashed to it (the best here would be 0), and the average number of values 
hashed to occupied cells—this should be close to 1. Measure these for the three hashing 
methods presented for a size of 60 cells. 


. Data for registrants in a swimming competition consists of the swimmers name, number, 


national ranking, and time in the 200-meter freestyle competition. These data are located in 
four lists: name, number, rank, t200. In all cases the same index is used to access all of 
the data for the same person. Sort these data in descending order on time and identify the 
persons in the top three spots and their times. 


. Steganography works by concealing a message rather than making it unreadable, as is 


done when using encryption. In the ideal situation nobody will even suspect that there is a 
second message hidden within the first. Consider a scheme that uses the spaces ina 
message: a single space is a ‘0’ and a double space is a “1.” The letters are coded as 5-bit 
codes starting with “A” = 00000, “B” = 00001, and so on. Write programs that encode and 
decode such messages. 
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Random.org random number server. https://www.random.org/ 
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